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Abstract 

This paper describes a pre-decoding mechanism for 
superscalar processors, which is necessary to issue 
instructions to plural functional units. Using a pre-decoding 
mechanism, a processor can issue instructions without 
increasing the cycle time or the branch penalty. The pie- 
decoder decides which functional unit should execute an 
instruction during a cache miss and refill. A special tag, 
which we call pre-decode tag, is generated by the pre- 
ricr.nder. This tag is saved with instructions to an instruction 
cache memory- The processor fetches instructions with the 
pre-decode tag and issues instructions controlled by the pre- 
decodc tag. Using this mechanism, fast cycle times, 
com pa nth! e to RISC processors, are attained. 

1 .Introduction 

RISC processors already achieve an instruction-execution 
raie of nearly one instruction per cycle. To attain greater 
performance, several approaches to multiple instruction 
execution per cycle are used. A superscalar approach and a 
VLIW approach are typical ones. Superscalar processors 
take advantage of maintaining scalar code compatibility, 
whereas VLIW processors do not. However, superscalar 
processors need an operand dependency analysis 
mechanism and an instruction issuing mechanism. 

I. ike RISC processors, superscalar processors should be 
designed with a short cycle time and a low branch penalty. 
The register scoreboarding technique is one approach to 
operand" dependency analysis. The Motorola M 88 000 RISC 
processed 1 1 employs this teehniquefthe register 
scoreboarding technique) to execute floating point 
operations having latencies. It analyzes the operand 
dependency and reads a register file in rhe decode stage. 
\js\n« this reuister scoreboarding technique, the time 
required for operand dependency analysis is comparable to 
i ha i of a register read access. Therefore the operand 
dependency analysis mechanism is not an obstacle to 
jiiiiining short cycle nines comparable to conventional RISC 
processors. 

On the other hand, the ins miction issuing process must 
occur before the register read process. Executing the 
instruction issuing process and register read access in one 
cycle becomes a problem when trying to attain the fast cycle 
times. 



The IBM RS6000[2] has a dispatch pipeline stage for 
issuing instructions before the decode stage when it reads 
the register file. Such a pipeline stage strategy increases the 
branch penalty. To compensate this side effect, the RS6000 
uses a zero time branch mechanism. However, to use this 
mechanism effectively, the compiler should move the 
condition setting code instructions far ahead of the 
conditional branch instructions. Therefore, the availability of 
this mechanism is limited. 

Our pre-decoding mechanism is developed for issuing 
instructions without increasing the branch penalty or the 
cycle time. 

2.Pre-decoding Mechanism Policy 

Our policy is to issuing instructions and execute the 
register read access in the decode pipeline stage of one 
cycle and reduce the instruction issuing process in that 
stage. The instruction issuing process is divided into two 
process. One is deciding which functional units should 
execute instructions. The other is sending instructions via a 
buffer to appropriate functional units. We choose the former 
process to be executed in a cache miss and refill process and 
the later process to be executed in the decode pipeline 
stage. Although the instruction cache miss penalty 
increases, such an increase degrades the processor 
performance little. 

3.Pre-decoder 

Figure 1 shows a system using the pre-decoding 
mechanism. A pre-decoder decides which functional unit 
should execute an instruction during a cache miss and refill. 

When a cache miss and refill occurs, instructions are 
transferred from main memory. The pre-decoder generates a 
pre-decode lag by decoding the instructions. The pre- 
decode tag indicates which functional units can execute 
instructions, avoiding functional unit coirflictst. The pre- 
decode tag and the instructions are saved to instruction 
cache memory. After a cache miss and refill is completed, 
the processor fetches the instructions with the pre-decode 



t More than two instructions are issued to a same functional 
unit 
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lr is necessary io dynamically analyze the operand 
dependency when the processor executes instructions, such 
as floating point operations which have latency. The pre- 
decoder cannot analyze the operand dependency, therefore, 
the register scoreboard dynamically analyzes it. 

Figure 2 shows [he fields of the pre-decode tag. If rhe 
processor has N functional units, which can he executed 
simultaneously, the pie-decode tag will have N resource 
fields and (N-1) priority fields. The resource field indicates 
which functional units should execute instructions. The 
priority Field indicates whether the instruction can be 
executed .simultaneously with the previous instruction. If 
there aren't enough functional units which execute 
simultaneous instructions, the priority field will be asserted. 

Figure 3 shows a block diagram of ilk: pre.-decoder. The 
PDC( pre-decode cell) i generates the priority field and the 
resource field of the instruction i. The PDC i sends the 
signals F(iji to the PDC i+l. Most superscalar processors 
have different type of" functional units, for example integer 
■jnits and floating point units, which can be executed 
simultaneously The value j of rhe signals F(i,j) indicates 
the type of functional units which will execute (he instruction 
i. The priority field oi an instruction i is generated by the 
follow mc algorithm: 



F(0,m) = 0. 

M is the number of m-type functional units. 

If an instruction i can be executed by an m-type functional 

unit ,T = F(i-l,m)+ 1. 

If T > M, then F(i,m) = 1 ; the priority field will be asserted. 
If T <= M, then F(i,m) = Tm; the priority field will be 
negated. 

If an instruction i cannot be executed by an m-type 
functional unit, F(i,m) = F(i-1 ,m). 

A resource field code is assigned to each functional unit. If 
the processor has just one functional unit of some type, the 
resource fields of instructions executed by this functional 
unit will simply be generated by decoding the instruction's 
operation code. If the processor has more than two 
functional units of some type, it is necessary to decide which 
functional unit of this type will execute the preceding 
instruction. Corresponding to this policy, the resource fields 
of instructions executed by such functional units are 
generated. For example, if the processor has two integer 
unit.*; A and B, it will be decided that A executes the 
preceding instruction and B executes following one. At first, 
the pre-decoder assigns an integer operation instruction to 
A. If A is already assigned an instruction, the pre-decoder 
will assign an integer operation instruction to B. 
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4.1ns mic t io n Issuing Unit 

Figure 5 shows an instruction issuing unit. Instructioas in 
;m instruction buffer are issued to functional units through a 
cross bar switch circuit, controlled by the pre-decode tag in 
a pre-decorie tag buffer. The functional units, which receive 
issued instructions; decode the instructions and read 
register files. In addition, the instructions in the instruction 
buffer arc analyzed for operand dependency by the register 
scoreboard. The instruction issuing, operand dependency 
anulvsis, decoding and register reading are done in one 
pipeline stage in oderto not increase the branch penalty. 

If the priority fields of an instruction i and an instruction j 
(kj) ate asserted and the priority fields of all instructions k 
(k'k<j) are negated : the cross bar switch circuit will issue 
all instructions m (i<=m<j) simultaneously. If the issued 
instruction has dependency on a previous instruction, the 
register scoreboard will cancel the issuing of that 
instruction, sending a cancel signal to the functional unit 
where the instruction is issued. This instruction is canceled, 
will be issued again by the cross bar switch circuit in the 



next cycle. If the issuing of all instructions ra (i<=m<j) 
completed, then the cross bar switch will issue I (1 >=> j), 

S.Example 





RS 


PF 


(l)add rO,rl,r2 


0 




(2) add r2,r3,r4 


1 


0 


(3) add r5,r6\r6 


0 


1 


(4) fadd fQ&ft 


2 


0 


(5) fmul f4.f5.f6 


3 





RS: resource field PF: priority field 

Assignment of resource fields 
integer unit 1 0 
integer unit 2 1 
floating adder unit 2 
floating multiplier unit 3 

Figure 5. Example Code 



23 



Ri-1,0) 



> ; (i-i,n 



PCiW.2) 



F(i,0) 




FA : Full Adder 
Piqui-e 6. PDC 



Degradation^) 



3.5 
3.0 

2 5 
2.0 



1.5 



D = 0.00 



!.0 



0.5 




- 2.00 



1-0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 
Instruction Cache Miss Ratio(%) 

Figure 7. Perfomance Deeradation 



instructions (1) through (4). The priority field of 
instruction^) is asserted because the processor can only 
execute two integer operations simultaneously. 

After the instruction cache miss and refill is completed, the 
processor fetches instructions (1) through (4) with the pre- 
decode tag. The processor only issues instructions (1) and 
(2) because the priority field of instruction^) is asserted. 
After issuing instructions (1) and (2), the processor issues 
instructions (3) and (4). When all four instructions have 
been issued to functional units, the processor fetches the 
next four instructions. 

devaluation 

Our pre-decode mechanism increases the cache miss and 
refill time. We evaluated this side effect by measuring the 
gate delay time for a pre-decoder of a processor containing 
two ALU, a floating multiply unit and a floating adder unit. 

Figure 6 shows a circuit of a PDC. The pre-decoder can be 
.constructed with four PDC. F(i,0) indicates the number of 
ALU operations. F[i,l) indicates the number of floating add 
operations. F(i,2) indicates the number of floating multiply 
operations. A critical path of this pre-decoder is instruction 
0 to PF3. Using the 0.8 micron CMOS logic, the gate delay 
time of this pre-decoder is 9.8 nsec. This delay time is 
shorter than the CPU cycle time because it takes 10 nsec to 
analyze the operand dependency. 

Figure 7 shows the performance degradation due to the 
pre-decoding mechanism, assuming that the cache refill time 
and the pre-decode time are 10 cycles and 1 cycle, 
respectively. If the instruction cache size is 16 Kbyte, the 
instruction cache miss ratio is 2% according to [3]. If the 
data cache size is 16 Kbyte, then the data cache miss ratio 
is 3% and the performance degradation is 1.3%. This 
degradation is much less than that of increasing the cycle 
time for issuing instructions or having extra stages in the 
decoder. 

7.Conclusion 

Using the pre-decode mechanism, the superscalar 
processors attain faster cycle times, comparable to RISC 
processors, without increasing the branch penalty. The side 
effect of this mechanism is an increase in the instruction 
cache miss and refill time. However, the degradation of the 
performance appears to be low. 
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